Replacing NA Values





Background



Between the years 1978-2003, the majority of the data pulled in from SNOTEL had NA snow depth values. However, most of them had accompanying snow water equivalent (SWE) values, which can be used to predict snow depth. This is because there is always a ratio relationship. This ratio is usually between 3:1 and 7:1, where for every 1 inch of precipitation, there is 3 to 7 inches of snowpack.

The purpose of the scripts in this directory is to show the process of coming up with an appropriate model to estimate the values that are currently missing, or NA. The purpose of this document is to summarize that process.





Step 1: Obtain Station Metadata


get_snotel_metadata.R





Metadata obtained - County, Elevation, Coordinates, etc. It was obtained thanks to the library(snotelr). I filtered for only Colorado stations with elevation in feet.

Datatables will appropiately be shown.






Step 2: Upload Weather Data/Join With Metadata


join_metadata_weatherdata.R





From the website linked in the R script, I download snotel data queried to my specifications (snow depth, temperature, etc). This is shown as snow_data. Then I joined that dataset with the previously uploaded and catored metadata, which will give me both climate observations and weather station info, because the metadata has the elevation and station location, but the weather data had the observations, and I need them together.

Because of the number of columns makes it hard to display, a picture of the data will be provided.



Find Pic in same Github Repository

Find Pic in same Github Repository





Step 3: Format Dates


wrangle_weatherdata.R





The main task of this script was to create a date column out of the provided date information from the weather/metadata joined table.

The joined table had a month column, where it had information about how the data was pulled in semi-monthly quantities, displaying phrases like Aug 1st half. It also provided a year column. The script uses a series of regex commands to create a date column. The following is a picture of the data.



Find Pic in same Github Repository

Find Pic in same Github Repository





Decision: Replacing NA Values


na_year_station_check.R


month_ratio.R


elevation_ratio.R





At this point, and before making any further steps to make visualizations, we have to address the NA problem. How bad is the problem? If we take out the NA from the snow depth column, our dataset gets cut in half, from about 70K to 35K rows.

Also, when does the NA data appear? Which years?





As shown by the graphic, most of the first years of observations are completely NA. This means that we will have to use the SWE (start_snow_water) to predict all snowpack numbers for those years.

This brings up these questions: Is there another datasource that can provide this data? How accurately can we predict years worth of missing data?

In order to answer both of these questions, we’ll have to explore and see. In terms of answering whether or not we can have confidence in predicted data, we’ll have to predict it first.

We can predict the snow depth by knowing a simple snow depth/SWE, known as the snow ratio. If the ratio is 10:1, this means that for every 1 inch of precipitation it equates to 10 inches of snow. In the equation below, the X is the ratio or 10/1.





Even if we have 35,000 NA values for snow depth, basically comprising the first 20 years, we still have 67,000 SWE values, meaning that if we can predict the ratio right given the circumstances (month, temperature, elevation), we will have a snow depth prediction, thanks to the formula above.



Does Month Affect the Ratio?



Like I stated above, it would not be wise to suppose that the ratio is constant through all months, years, elevations, and temperatures. The snow ratio is likely to fluctuate depending on these, because snow’s size is affected by temperature, collective weight, etc. Is it affected by month? All of the following graphics are used where NA values have been taken out, for obvious reasons that we need numbers in order to make predictions.Remember: each dot is a semi-monthly observation.





There is clearly a possibility that snow ratio is different between months, probably due to temperature, depth of snow, etc. A higher ratio here basically means that given a specific precipitation amount a storm will produce, there will be more snowfall, because its density is smaller.

Let’s look at it in the standpoint of a regression. We want to predict the snow depth based off the SWE, which might differ because of month, and therefore would have different slopes.





There is a possibility that each month has a different slope, and that model will be explained further in na_regression_month.R.



Does Year Affect the Ratio?



Here, each dot represents a station. Snow ratio isn’t affected by year. Though, it might be worth taking a look at the fact that most of the extraneous values happened pre-2010. In fact, we may consider not putting the outlying values in our prediction model (they probably are measuring/reporting errors), and using the model to predict for them new values.





Does Elevation Affect the Ratio?



It seems that elevation doesn’t affect the snow ratio





Anomalies: 10:1 Snow Ratio



odd_ratio_check.R



As seen on the last boxplot, there are many 10:1 snow ratio observations that seem out of place.

These 10:1 observations may not have a yearly pattern, but there may be an explanation for the unusual spike in 2014, when 48 observations were made.





Summarizing



Our problem is that we have a host of NA values in our snow depth column, but we can use our almost complete SWE column to predict those NA snow depth values. We can do this knowing that the two are usually related linearly. However, this ratio relationship may not be constant with elevation, temperature, and month influences.

The previous sections suggest that the snow ratio could be influenced by month, and that it is probably worth a shot performing a regression analysis on it, as well conducting another with temperature as a variable.

It may also be worth it to investigate some of the outlying plot features and patterns which may not directly relate to our NA value replacements, but may have significance of its own.





Step 4: Prep Data For Regression


na_regression_prep.R





Since the purpose of regression is to find the best pattern to predict missing values, I am also under the assumption that we want to predict data as if it was in the best case scenario. That means we will eliminate any obvious outliers and other forms of data that would prevent us from accomplishing this. And by outlier, I am mostly referring to observations that are so far from what would make sense (15:1 ratio) that it probably was a measurement/reporting error, and we are more concerned with the general pattern, which will give us the most insight. This includes:





Data Validation For Regression





The following regressions were performed using validation principles. Shortly, if we split the data up into 2, and fit a regression to both models, we can better tell how well the data can be fit by the consistency of the regressions fitted. If the regressions have similar errors and shape, then we can better confidence in the model. If not, then we may want to collect more data, take the average of the models, or pick another model.





Step 5: Fitting the Regressions



Temperature Data Problem


na_tempdata_analysis.R



As expected with weather station observations, there is error, and some temperature observations are obviously not possible, for example, the wild basin station, January 15-31, 2006, reported an average high temp of 97.

The black dots represent the top .5% mean max temperature of each month (remember that each dot represents a 2-week period), and are mostly those extraneous values. Those values I consider most innaccurate would be in the winter months, because any reading above 85 degrees is more likely in the summer.





Simple Linear Regression


na_simple_regression.R



Before I conducted a multiple regression, I conducted a simple linear regression, water equivalent upon snow depth, using model validation techniques. Since we are not choosing this model, I will not go into too much depth about it. However, I will provide that the highest adjusted r-squared correlation was .871, which was calculated using a model who’s Y values and water equivalent values were both transformed, or raised to a different power, in order to make sure the data was in the correct shape for a linear regression.

The following are the regressions created, each with different numeric tranformations.





Month Regression



na_regression_month.R



After fitting the regression using month as a factor, with many interractions testing significant (creating the different slopes per factor), I also decided to fit a square root tranformation on both the snow depth and SWE columns.



Temperature Regression



na_regression_temp.R



Since the temperature variable is not a factor, the regression only fits one line. After fitting a line with an interraction involved, I also tried a .8 transformation on snow depth, and a square root transformation on SWE, but the highest correlated and seemingly best fit data was the regression without transformations.

The following plot shows that increased temperature decreases the slope, which is also the ratio. Remember that the slope is actually the snow ratio.

Maybe the best way to read this 3D plot is to hit the icon above that reads orbital rotation as you hover over it, and move the graph in a circle while the Snow Depth axis is vertical. This way, you will notice that the lower temperature produces steeper slopes, meaning higher snow ratio and that the slope changes with the temperature, meaning the model picked with an interaction could be correct. (Data may take a moment to load)







Comparing Month & Temperature Regressions



na_prediction_prep.R